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Abstract. Social media content has grown exponentially in the recent 
years and the role of social media has evolved from just narrating life 
events to actually shaping them. In this paper we explore how many 
resources shared in social media are still available on the live web or 
in public web archives. By analyzing six different event-centric datasets 
of resources shared in social media in the period from June 2009 to 
March 2012, we found about 11% lost and 20% archived after just a 
year and an average of 27% lost and 41% archived after two and a half 
years. Furthermore, we found a nearly linear relationship between time 
of sharing of the resource and the percentage lost, with a slightly less 
linear relationship between time of sharing and archiving coverage of 
the resource. From this model we conclude that after the first year of 
publishing, nearly 11% of shared resources will be lost and after that we 
will continue to lose 0.02% per day. 
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1 Introduction 



With more than 845 million Facebook users at the end of 2011 [5] and over 140 



million tweets sent daily in 2011 16 users can take photos, videos, post their 
opinions, and report incidents as they happen. Many of the posts and tweets are 
about quotidian events and their preservation is debatable. However, some of 
the posts and events are about culturally important events whose preservation 
is less controversial. In this paper we shed light on the importance of archiving 
social media content about these events and estimate how much of this content 
is archived, still available, or lost with no possibility of recovery. 



To emphasize the culturally important commentary and sharing, we col- 
lected data about six events in the time period of June 2009 to March 2012: 
the H1N1 virus outbreak, Michael Jackson's death, the Iranian elections and 
protests, Barack Obama's Nobel Peace Prize, the Egyptian revolution, and the 
Syrian uprising. 
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2 Related Work 

To our knowledge, no prior study has analyzed the amount of shared resources 
in social media lost through time. There have been many studies analyzing the 
behavior of users within a social network, how they interact, and what content 
they share [3j[l9|[20j[23] . As for Twitter, Kwak et al. [6] studied its nature and 
its topological characteristics and found a deviation from known characteristics 



of human social networks that were analyzed by Newman and Park 10 . Lee 
analyzed the reasons behind sharing news in social media and found that infor- 
mativeness was the strongest motivation in predicting news sharing intention, 
followed by socializing and status seeking [3] . Also shared content in social media 



like Twitter move and diffuse relatively fast as stated by Yang et al. 22 



Further more, many concerns were raised about the persistence of shared 
resources and web content in general. Nelson and Allen studied the persistence 
of objects in a digital library and found that, with just over a year, 3% of the 
sample they collected have appeared to no longer be available [9] . Sanderson et al. 
analyzed the persistence and availability of web resources referenced from papers 
in scholarly repositories using Memento and found that 28% of these resources 



have been lost 14 . Memento 17 is a collection of HTTP extensions that enables 
uniform, inter-archive access. Ainsworth et al. [lj examined how much of the 
web is archived and found it ranges from 16% to 79%, depending on the starting 
seed UPJs. McCown et al. examined the factors affecting reconstructing websites 
(using caches and archives) and found that PageRank, Age, and the number of 
hops from the top-level of the site were most influential [8] . 



3 Data Gathering 

We compiled a list of UPJs that were shared in social media and correspond to 
specific culturally important events. In this section we describe the data acqui- 
sition and sampling process we performed to extract six different datasets which 
will be tested and analyzed in the following sections. 



3.1 Stanford SNAP Project Dataset 

The Stanford Large Network Dataset is a collection of about 50 large network 
datasets having millions of nodes, edges and tuples. It was collected as a part 



of the Stanford Network Analysis Platform (SNAP) project 15 . It includes 
social networks, web graphs, road networks, Internet networks, citation networks, 
collaboration networks, and communication networks. For the purpose of our 
investigation, we selected their Twitter posts dataset. This dataset was collected 
from June 1st, 2009 to December 31st, 2009 and contains nearly 476 million 
tweets posted by nearly 17 million users. The dataset is estimated to cover 20%- 
30% of all posts published on Twitter during that time frame [21J . To select which 
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events will be covered in this study, we examined CNN's 2009 events timelin^J 
We wanted to select a small number of events that were diverse, with limited 
overlap, and relatively important to a large number of people. Given that, we 
selected four events: the H1N1 virus outbreak, the Iranian protests and elections, 
Michael Jackson's death, and Barrack Obama's Nobel Peace Prize award. 



Preparation: A tweet is typically composed of text, hashtags, embedded re- 
sources or URIs and usertags all spanning a maximum of 140 characters. Here 
is an example of a tweet record in the SNAP dataset: 

T 2009-07-31 23:57:18 

U |http:/ /Twitter .com/nickgotch 

W KT <arocJdnR.iude: December 21, 2009 Depopulation by Food Will Begin 
http://is.gd/lWMZb WHOA. .BETTER WATCH RT plz #pwa #tcot 

The line starting with the letter T indicates the date and time of the tweet 
creation. While the line starting with U shows a link to the user who au- 
thored this particular tweet. Finally, the line starting with W shows the en- 
tire tweet including all the user-references "@rockingjude" , the embedded URIs 
' |http: //is.gd/1 WMZbf , and hashtags "#pwa #tcot". 



Tag Expansion: We wanted to select tweets that we can say with high confi- 
dence are about a selected event. In this case, precision is more important than 
recall as collecting every single tweet published about a certain event is less 
important than making sure that the selected tweets are definitely about that 
event. Several studies focused on estimating the aboutness of a certain web page 



or a resource in general 12 18 . Fortunately in Twitter, hashtags incorporated 
within a tweet can help us estimate their "aboutness" . Users normally add cer- 
tain hashtags to their tweets to ease the search and discoverability in following 
a certain topic. These hashtags will be utilized in the event-centric filtration 
process. 

For each event, we selected initial tags that describe it (Tablejl]). Those initial 
tags were derived empirically after examining some event-related tweets. Next 
we extracted all the hashtags that co-occurred with our initial set of hashtags. 
For example, in class H1N1 we extracted all the other hashtags that appeared 
along with #hlnl within the same tweet and kept count of their frequency. 
Those extracted hashtags were sorted in descending order of the frequency of 
their appearance in tweets. We removed all the general scope tags like #cnn, 
#health, #death, #war and others. In regards to aboutness, removing general 
tags will indeed decrease recall but will increase precision. Finally we picked the 
top 8-10 hashtags to represent this event-class and be utilized in the filtration 
process. Table [T] shows the final set of tags selected for each class. 



Tweet Filtration: In the previous step we extracted the tags that will help us 
classify and filter tweets in the dataset according to each event. This filtration 

1 http:/ /www. cnn.com/2009/US/12/16/year.timeline/index. html 
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Event 


Initial Hashtags 


Top Co-occurring Hashtags 


H1N1 
Outbreak 


'hlnl' 
= 61,351 


'swine'— 61,829 'swincfiu' = 56,419 'flu'=8,436 

'pandemic'=6,839 'influcnza' = l,725 'grippe'= 1,559 'tamiflu'=331 


M. Jackson's 
Death 


'michacljackson' 
—22,934 


'michael'=27,075 'mj' = 18,584 'thisisit'8,770 'rip' = 3,559 'jacko' = 3,325 
'kingofpop , =2,888 'jackson' = 2,559 'thriller ' = 1 ,357 'thankyoumichacl' = 1 ,050 


Iranian 
Elections 


'iranelection' 
=911,808 


'iran'949,641 'gr88' = 197, 1 13'tehran' = 109 ,006 'frcciran'= 1 3 , 378 
'ncda'=191,067 'mousavi' = 16,587 1 unitcd4iran' = 9 , 198 'iranrcvolution' = 7,295 


Obama's 
Nobel Prize 


'obama'=48,161 & 
'pcacc'=3,721 


L nobcr=2,261 'obamanobcr = 14 'nobclprizc" 1 nobolpcaco'= 1 1 3 
'barack' = 1292 'nobelpeaceprize'=107 



Table 1. Twitter hashtags generated for filtering and their frequency of occurring 



process aims to extract a reasonable sized dataset of tweets for each event and to 
minimize the inter-event overlap. Since the life and persistence of the tweet itself 
is not the focus of this study but rather the associated resource that appears 
in the tweet (image, video, shortened URI or other embedded resource), we will 
extract only the tweets that contain an embedded resource. This step resulted in 
181 million tweets with embedded resources (http://is.gd/lWMZb in the prior 
example). These tweets were further filtered to keep only the tweets that have 
at least one of the expanded tags obtained from Table [T] The number of tweets 
after this phase reached 1.1 million tweets. 

Filtering the tweets based on the occurrence of at least one of the hashtags 
only is undesirable as it will cause two problems: First, it will introduce possible 
event overlap due to general tweets talking about two or more topics. Second, 
is that using only the single occurrence of these tags will yield a huge amount 
of tweets and we need to reduce this size to reach a more manageable size. In- 
tuitively speaking, strongly related hashtags will co-occur often. For example, 
a tweet that has #hlnl along with #swineflu and #pandemic is most likely 
about the H1N1 outbreak rather than a tweet having just the tag #flu or just 
#sick. Filtering with this co-occurrence will in turn solve both problems as by 
increasing relevance to a particular event, general tweets that talk about several 
events will be filtered out thus diminishing the overlap, and in turn it will reduce 
the size of the dataset. 

Next, we increase the precision of the tweets associated with each event from 
the set of 1.1 million tweets. In the first iteration we selected the tag that had the 
highest frequency of co-occurrence in the dataset with the initial tag and added 
it to a set we will call the selection set. After that we check the co-occurrence 
of all the remaining extracted tags with the tag in the selection set and record 
the frequencies of co-occurrence. After sorting the frequencies of co-occurrence 
with the tag from the selection set, we pick the highest one to keep add it to 
the selection set. We repeat this step of counting co-occurrences but with all the 
previously extracted hashtags in the selection set from previous iterations. 

To elaborate, for H1N1 assume that the hastag '#hlnl' had the highest 
frequency of appearance in the dataset so we add it to the selection set. In the 
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next iteration we record the how many times each tag in the list appeared along 
with '#hlnl' in a same tweet. If we selected '#swine' as the one with the highest 
frequency of occurrence with the initial tag '#hlnl' we add it to the selection list 
and in the next iteration we record the frequency of occurrence of the remaining 
hashtags with both of the extracted tags l #hlnV and '#swine'. We repeat this 
step, for each event, to the point where we have a manageable size dataset which 
we are confident in its 'aboutness' in relation to the event. 



Event 


Hashtags selected for Alteration 


Tweets Extracted 


Operation Performed 


Final Tweets 


MJ 


michacl 

michacl & michacljackson 


27,075 
22,934 


Sample 10% 


2,293 


Iran 


iran & iranclcction 

iran & iranclcction & gr88 

iran & iranclcction & gr88 & ncda 

iran & iranelection & gr88 & neda & tehran 


949,641 

911,808 

189,757 

91,815 

34,294 


Sample 10% 


3,429 


H1N1 


hlnl 

hlnl & swine 

hlnl &l swine &z swincflu 

hlnl & swine & swincflu & pandemic 


61,351 
44,972 
42,574 
5,517 


Take All 


5,517 


Obama 


obama 

obama & nobel 


48,161 
1,118 


Take All 


1,118 



Table 2. Tweet Filtration iterations and final tweet collections 



Two problems appeared from this approach with the Iran and Michael Jack- 
son datasets. In the Iran dataset the number of tweets was in hundreds of thou- 
sands and even with 5 tags co-occurrence it was still about 34K+ tweets. To 
solve this we performed a random sampling from those resulting tweets to take 
only 10% of them resulting in a smaller manageable dataset. The second problem 
with the Michael Jackson dataset upon using 5 tags to decrease it to a manage- 
able size we realized there were few unique domains for the embedded resources. 
A closer look revealed this combination of tags was mostly border-line tweet 
spam (MJ ringtones). To solve this we used only the two top tags "#michael" 
and "#michaeljackson" , and then we randomly sampled 10% of the resulting 
tweets to reach the desired dataset size (Table [1. 



3.2 Egyptian Revolution Dataset 

The one year anniversary of this event was the original motivation for this 
study |13| . In this case, we started with an event and then tried to get so- 
cial media content describing it. Despite its ubiquity, gathering social media for 
a past event is surprisingly hard. We picked the Egyptian revolution due to the 
role of the social media in curating and driving the incidents that led to the 
resignation of the president. Several initiatives were commenced to collect and 
curate the social media content during the revolution like R-sheif.or^] which 
specializes in social content analysis of the issues in the Arab world by using 
aggregate data from Twitter and the Web. We are currently in the process of 
obtaining the millions of records related to the Arab Spring of 2011. Meanwhile, 
we decided to build our own dataset manually. 



2 http://www.r-shief.org/ 
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There are several sites that curate resources about the Egyptian Revolution 
and we want to investigate as many of them as possible. At the same time, 
we need to diversify our resources and the types of digital artifacts that are 
embedded in them. Tweets, videos, images, embedded links, entire web pages 
and books were included in our investigation. For the sake of consistency, we 
limited our analysis to resources created within the period from the 20th of 
January 2011 to the 1st of March 2011. In the next subsections we explain each 
of the resources we utilized in our data acquisition in detail. 

Storify: Storify is a website that enables users to create stories by creating 
collections of URIs (e.g., Tweets, images, videos, links) and arrange them tem- 
porally. These entries are posted by reference to their host websites. Thus, adding 
content to Storify does not necessarily mean it is archived. If a user added a video 
from YouTube and after a while the publisher of that video decided to remove it 
from YouTube the user is left with a gap in their Storify entry. For this purpose 
we gathered all the Storify entries that were created between 20th of January 
2011 and the 1st of March 2011, resulting in 219 unique resources. 

IAmJan25: Some entire websites were dedicated as a collection hub of media 
to curate the revolution. Based on public contributions, those websites collect 
different types of media, classify them, order them chronologically and publish 
them to the public. We picked a website named IAmJan25.com, as an example 
of these websites, to analyze and investigate. The administrators of the website 
received selected videos and images for notable events and actions that happened 
during the revolution. Those images and videos were selected by users as they 
vouched for them to be of some importance and they send the resource's URI to 
the web site administrators. The website itself is divided into two collections: a 
video collection and an image collection. The video collection had 2387 unique 
URIs while the image collection had 3525 unique URIs. 

Tweets From Tahrir: Several books were published in 2011 documenting the 
revolution and the Arab Spring. To bridge the gap between books and digital 
media we analyzed a book entitled Tweets from Tahrir [IT] which was pub- 
lished on April 21st, 2011. As the name states, this book tells a story formed by 
tweets of people during the revolution and the clashes with the past regime. We 
analyzed this book as a collection of tweets that had the luxury of a paperback 
preservation and focused on the tweeted media, in this case images. The book 
had a total of 1118 tweets having 23 unique images. 

3.3 Syria Dataset 

This dataset has been selected to represent a current (March 2012) event. Using 
the Twitter search API, we followed the same pattern of data acquisition as 
in section 3.1. We started with one hashtag, #Syria, and expanded it. Table [3] 
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show the tags produced from the tag expansion step. After that each of those 
tags were input into a process utilizing the Twitter streaming API and produced 
the first 1000 results matching each tag. From this set, we randomly sampled 
10%. As a result, 1955 tweets were extracted each having one or more embedded 
resources and tags from the expanded tags in Table [3j 



Initial Hashtags 


Extracted Hashtags 


'Syria' 


'Basilar' 'RiseDaraascus' 'GcnociddnSyria' 'STOPASSAD2012' 'AssadCrimes' 'Assad' 



Table 3. Twitter #Tags generated for filtering the Syrian uprising 

Table [4] shows the resources collected along with the top level domains that 
those resources belong to for each event. 



Event 


Top Domains (number of resources found) 


MJ 


youtube (110), twitpic (45), latimes (43), cnn (30), amazon (30) 


Iran 


youtube (385), twitpic (36), blogspot (30), roozonlinc (29) 


H1N1 


rhizalabs (676), routers (17), google (16), flutrackers (16), calgaryhcrald (11) 


Obama 


blogspot (16), nytimes (15), wordpress (12), youtube (11), cnn (10) 


Egypt 


youtube (2414), cloudfront (2303), yfrog (1255), twitpic (114), imagcshack.us (20) 


Syria 


youtube (130), twitter (61), hostpic.biz (9), tclegraph.co.uk (5) 



Table 4. The top level domains found for each event ordered descendingly by the 
number of resources. 



4 Uniqueness and Existence 

From the previous data gathering step we obtained six different datasets related 
to six different historic events. For each event we extracted a list of URIs that 
were shared in tweets or uploaded to sites like Storify or IAmJan25. To answer 
the question of how much of the social media content is missing we test those 
URIs for each dataset to eliminate URI aliases in which several URIs identify to 
the same resource. Upon obtaining those unique URIs we examine how many of 
which are still available on the live web and how many are available in public 
web archives. 

4.1 Uniqueness 

Some URIs, especially those that appear in Twitter, may be aliases for the 
same resource. For example ' |http://bit7l y/2EEjBl ' and "http://goo.gl/2ViC" 
both resolve to "http://www.cnn.com". To solve this, we resolved all the URIs 
following redirects to the final URI. The HTTP response of the last redirect has 
a field called location that contains the original long URI of the resource. This 
step reduced the total number of URIs in the six datasets from 21,625 to 11,051. 
Table [5] shows the number of unique resources in every dataset. 

4.2 Existence on the Live- Web 

After obtaining the unique URIs from the previous step we resolve all of them and 
classify them as Success or Failure. The Success class includes all the resources 
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All 


Unique 






All 




Unique 






2,293 | 


1,187=51.77% | 




3,429 


1,340=39.08% 




MJ 


Archived 


Not Archived 


Iran 




Archived 


Not Archived 




Available 


316 = 26.62% 


474=39.93% 


Available 


415 = 30.97% 


586=43.73% 




Missing 


90=7.58% 


307=25.86% 


397=33.45% Missing 


101 = 7.54% 


238 = 17.76% 


339 = 25.30% | 




406=34.20% 


each/1,187 




516=38.51% 




each/1,340 






All 


Unique 






All 




Unique 






j 5,517 


| 1,645 = 29.82% 






1,118 


370=33.09% 




H1N1 


Archived 


Not Archived 


Obama 


Archived 




Not Archived 




Available 


595=36.17% 


656=39.88% 


Available 


143=38.65% 


135 = 36.49% 




Missing 


98=5.96% 


296=17.99% 


394=23.95% Missing 


33=8.92% 


59 = 15.95% 


92 = 24.86% 




693=42.12% 


each/1,645 






176=47.57% 


each/370 



All Unique All Unique 





7,313 


6,154=84.15% 




1,955 


355 = 18.16% 




Egypt 


Archived 


Not Archived 


Syria 


Archived 


Not Archived 




Available 


1,069 = 17.37% 


4440 = 72.15% 


Available 


19=5.35% 


311 = 87.61% 




Missing 


173 = 2.81% 


472 = 7.67% 


645=10.48% Missing 


0=0% 


25 = 7.04% 


25=7.04% 




1242=20.18% 


cach/6,154 


19=5.35% 





Table 5. Percentages of unique resources from all the extracted ones we obtained per 
event and the percentages of presence of those unique resources on live web and in 
archives. All resources = 21,625, Unique resources = 11,051 

that ultimately return a "200 OK" HTTP response. The Failure class includes 
all the resources that return a "4XX" family response like: "404 Not Found", 
"403 Forbidden" and "410 Gone" , the "30X" redirect family while having infinite 
loop redirects, and server errors with response "50X" . To avoid transient errors 
we repeated the requests, on all datasets, several times for a week to resolve 
those errors. 

We also test for "Soft 404s" , which are pages that return "200 OK" response 
code but are not a representation of the resource, using a technique based on a 
heuristic for automatically discovering soft 404s from Bar-Yossef et al. [2j. We 
also include no response from the server, as well as DNS timeouts, as failures. 
Note that failure means that this resource is missing on the live web. Table [5] 
summarizes, for each dataset, the total percentages of the resources missing from 
the live web and the number of missing resources divided by the total number 
of unique resources. 



4.3 Existence in the Archives 

In the previous step we tested the existence of the unique list of URIs for each 
event on the live web. Next, we evaluate how many URIs have been archived 
in public web archives. To check those archives we utilize the Memento frame- 
work. If there is a memento for the URI, we download its memento timemap and 
analyze it. The timemap is a datestamp ordered list of all known archived ver- 
sions (called "mementos") of a URI. Next, we parse this timemap and extract 
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the number of mementos that point to versions of the resource in the public 
archives. We declare the resource to be archived if it has at least one memento. 
This step was also repeated several times to avoid the transient states of the 
archives before deeming a resource as unarchived. The results of this experiment 
along with the archive coverage percentage are presented in Table [5j 

5 Existence as a Function of Time 

Inspecting the results from the previous steps suggests that the number of miss- 
ing shared resources in social media corresponding to an event is directly propor- 
tional with its age. To determine dates for each of the events this we extracted 
all the creation dates from all the tweet-based datasets and sorted them. For 
each event, we plotted a graph illustrating the number of tweets per day related 
to that event as shown in figure [T] Since the dataset is separated temporally into 
3 partitions, and in order to display all the events on one graph we reduced the 
size of the x-axis by removing the time periods not covered in our study. 




Fetauary2C12 Fatiuary20ll Decemtwi ZCOS Oc!otKr2009 Augk»l20O9 June2009 



Fig. 1. URIs shared per day corresponding"^ each event and showing the two peaks 
in the non-Syrian and non-Egyptian events 

Upon examining the graph we found an interesting phenomena in the non- 
Syrian and non-Egyptian events: each event has two peaks. Upon investigating 
history timelines we came to conclusion that those peaks reflect a second wave of 
social media interaction as a result of new incident within the same event after 
a period of time. For example, in the H1N1 dataset, the first peak illustrates the 
world-wide outbreak announcement while the second peak denotes the release 
of the vaccine. In the Iran dataset, the first peak shows the peak of the elections 
while the second peak pinpoints the Iranian trials. As for the MJ dataset the 
first peak corresponds to his death and the second peak describes the rumors 
that Michael Jackson died of unnatural causes and a possible homicide. For 
the Obama dataset, the first peak reveals the announcement of his winning the 
prize while the second peak presents the award-giving ceremony in Oslo. For 
the Egyptian evolution, the resources are all within a small time slot of 2 weeks 
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around the date 11th of February. As for the Syrian event, since the collection was 
very recent there was no obvious peaks. Those peaks we examined will become 
temporal centroids of the social content collections (the datasets). MJ (June 
25th & July 10th 2009), Iran (June 13th & 1st August 2009), H1N1 (September 
11th & 5th October 2009), and Obama (October 9th & December 10th 2009). 
Egypt was (February 11th 2011) and the Syria dataset also had one centroid 
on March 27th 2012. We split each event according to the two centroids in each 
event accordingly. Figure [I] shows those peaks and Table [6] shows the missing 
content and the archived content percentages corresponding to each centroid. 





MJ 


Iran 


H1N1 


Obama 


Egypt 


Syria 


% Missing 


36.24% 


31.62% 


26.98% 


24.47% 


23.49% 


25.64% 


24.59% 


26.15% 


10.48% 


7.04% 


% Archived 


39.45% 


30.78% 


43.08% 


36.26% 


41.65% 


43.87% 


47.87% 


46.15% 


20.18% 


5.35% 



Table 6. The Split Dataset 



li" - lui Iranian Elettiflns— 
(Protest* ft Elections) 



25 th Jun 2009: Michael Jadaarfs 
(Death Announcement) 



0™ JjI2009: Michael Jackson's— 
(OeathUnnaiural Causes) 



l" 1- Aug 2O09:lranipn Elections . 
(Trials) 



Percentage Missing 
Percentage Archived 



35™ Mai 1012: The Syrian Uprising 



- / 



10™ Dec 2<m- Obarna's Nobel 
Peace Prire (Ceremony) 



Oct 2009: Obama's Note Peace 
Prize (Announcement) 




11™ Feb 2011: Egyptian Revolution 



Oct 2009: HlrJlOutbreak 
(Vaccine Release] 



11"" Sep 2009: HINlOuilweak 
(Worldwide outbreak) 



Apr 201 2 



Sept 2011 



Aug 2010 



Fig. 2. Percentage of content missing and archived for the events as a function of time. 

Figure [2] shows the missing and archived values from Table[6]as a function of time 
since shared. Equation[T]shows the modeled estimate for the percentage of shared 
resources lost, where Age is in days. While there is a less linear relationship 
between time and being archived, equation [2] shows the modeled estimate for 
the percentage of shared resources archived in a public archive. 

Content Lost Percentage = 0.02(Age in days) + 4.20 (1) 

Content Archived Percentage = 0.04(^4<?e in days) + 6.74 (2) 

Given these observations and our curve fitting we estimate that after a year from 
publishing about 11% of content shared in social media will be gone. After this 
point, we are losing roughly 0.02% of this content per day. 
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6 Conclusions and Future work 

We can conclude that there is a nearly linear relationship between time of shar- 
ing in the social media and the percentage lost. Although not as linear, there is 
a similar relationship between the time of sharing and the expected percentage 
of coverage in the archives. To reach this conclusion, we extracted collections of 
tweets and other social media content that was posted and shared in relation to 
six different events that occurred in the time period from June 2009 to March 
2012. Next we extracted the embedded resources within this social media content 
and tested their existence on the live web and in the archives. After analyzing 
the percentages lost and archived in relation to time and plotting them we used 
a linear regression model to fit those points. Finally we presented two linear 
models that can estimate the existence of a resource, that was posted or shared 
at one point of time in the social media, on the live web and in the archives as 
a function of age in the social media. 

In the next stage of our research we need to expand the datasets and import 
other similar datasets especially in the uncovered temporal areas (e.g., the year of 
2010 and before 2009). Examining more datasets across extended points in time 
could enable us to better model these two functions of time. Also several other 
factors beside time would be analyzed to understand their effect on persistence 
on the live web and archiving coverage like: publishing venue, rate of sharing, 
popularity of authors and the nature of the related event. 
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